week 6: integers and other monsters

ordered categories

Ordered categorical outcomes

Ordered categories are variables that are discrete, like a count, except that the values merely indicate different ordered levels.1 Most Likert scale questions are ordered categories, even though we pretend they’re continuous (see polychoric correlations).

If we want to model an outcome as an ordered categorical variable, what we’re really asking is how does moving along an associated predictor variable move predictions in the outcome progressively through the categories in sequence. We must therefore model our outcomes in the right order.

To do so, we’ll use the principles of the generalized linear model discussed last week. That is, we use a link function – the CUMULATIVE LINK FUNCTION – to model the probability of a value that is that value or smaller.

example: the trolley problem

data(Trolley, package = "rethinking")
trolley <- Trolley
glimpse(trolley)
Rows: 9,930
Columns: 12
$ case      <fct> cfaqu, cfbur, cfrub, cibox, cibur, cispe, fkaqu, fkboa, fkbo…
$ response  <int> 4, 3, 4, 3, 3, 3, 5, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 3, …
$ order     <int> 2, 31, 16, 32, 4, 9, 29, 12, 23, 22, 27, 19, 14, 3, 18, 15, …
$ id        <fct> 96;434, 96;434, 96;434, 96;434, 96;434, 96;434, 96;434, 96;4…
$ age       <int> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, …
$ male      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ edu       <fct> Middle School, Middle School, Middle School, Middle School, …
$ action    <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
$ intention <int> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
$ contact   <int> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ story     <fct> aqu, bur, rub, box, bur, spe, aqu, boa, box, bur, car, spe, …
$ action2   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
trolley %>% 
  ggplot(aes( x=response )) +
  geom_histogram()
Code
trolley %>%
  count(response) %>% 
  mutate(pr_k     = n/nrow(trolley),
         cum_pr_k = cumsum(pr_k)) %>% 
  ggplot(aes(x = response, y = cum_pr_k)) +
  geom_point(size = 3, color = "#1c5253") +
  geom_line(color = "#1c5253", alpha = .3) + 
  labs(x = "response",
       y = "cumulative probability")
Code
# primary data
trolley_plot <-
  trolley %>%
  count(response) %>%
  mutate(pr_k     = n / nrow(trolley),
         cum_pr_k = cumsum(n / nrow(trolley))) %>% 
  mutate(discrete_probability = ifelse(response == 1, cum_pr_k, cum_pr_k - pr_k))

# annotation
text <-
  tibble(text     = 1:7,
         response = seq(from = 1.25, to = 7.25, by = 1),
         cum_pr_k = trolley_plot$cum_pr_k - .065)

trolley_plot %>% 
  ggplot(aes(x = response, y = cum_pr_k,
             color = cum_pr_k, fill = cum_pr_k)) +
  geom_line() +
  geom_point(shape = 21, colour = "grey92", 
             size = 2.5, stroke = 1) +
  geom_linerange(aes(ymin = 0, ymax = cum_pr_k),
                 alpha = 1/2) +
  geom_linerange(aes(x = response + .025,
                     ymin = ifelse(response == 1, 0, discrete_probability), 
                     ymax = cum_pr_k),
                 color = "black") +
  # number annotation
  geom_text(data = text, 
            aes(label = text),
            size = 4) +
  scale_x_continuous(breaks = 1:7) +
  scale_y_continuous("cumulative proportion", breaks = c(0, .5, 1), 
                     limits = c(0, 1.1)) +
  theme(axis.ticks = element_blank(),
        axis.title.y = element_text(angle = 90),
        legend.position = "none")

On to the modeling. We’ll start with an intercept-only model.

\[\begin{align*} R_i &\sim \text{Categorical}(p) \\ \text{logit}(p_k) = \alpha_k - \phi \phi_i &= 0 \\ \alpha_k &\sim \text{Normal}(0, 1.5) \end{align*}\]

Remember, \(\phi\) doesn’t have an intercept because there’s a \(\alpha\) for each cut point.

m1 <- brm(
  data = trolley,
  family = cumulative, 
  response ~ 1, 
  prior = c( prior(normal(0, 1.5), class = Intercept) ),
  iter=2000, warmup=1000, cores=4, chains=4,
  file=here("files/data/generated_data/m61.1")
)
summary(m1)
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: response ~ 1 
   Data: trolley (Number of observations: 9930) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -1.92      0.03    -1.98    -1.86 1.00     3004     3065
Intercept[2]    -1.27      0.02    -1.31    -1.22 1.00     4085     3449
Intercept[3]    -0.72      0.02    -0.76    -0.68 1.00     4532     3499
Intercept[4]     0.25      0.02     0.21     0.29 1.00     4697     3614
Intercept[5]     0.89      0.02     0.85     0.93 1.00     4370     3119
Intercept[6]     1.77      0.03     1.71     1.82 1.00     5091     3618

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
m1 %>% 
  fixef() %>% 
  inv_logit_scaled() %>% 
  round(digits = 3)
             Estimate Est.Error  Q2.5 Q97.5
Intercept[1]    0.128     0.508 0.122 0.135
Intercept[2]    0.220     0.506 0.212 0.228
Intercept[3]    0.328     0.505 0.318 0.337
Intercept[4]    0.562     0.505 0.552 0.571
Intercept[5]    0.709     0.506 0.700 0.718
Intercept[6]    0.854     0.507 0.847 0.861

Note: the standard errors (Est.Error) are not valid for this approach. (Just look at how they compare to the quantiles of the distribution.) Drawing from the posterior is the best solution to this.

as_draws_df(m1) %>% 
  mutate_at(vars(starts_with("b_")), inv_logit_scaled) %>% 
  pivot_longer(starts_with("b_")) %>% 
  group_by(name) %>% 
  summarise(mean = mean(value),
            sd   = sd(value),
            ll   = quantile(value, probs = .025),
            ul   = quantile(value, probs = .975))
# A tibble: 6 × 5
  name            mean      sd    ll    ul
  <chr>          <dbl>   <dbl> <dbl> <dbl>
1 b_Intercept[1] 0.128 0.00338 0.122 0.135
2 b_Intercept[2] 0.220 0.00411 0.212 0.228
3 b_Intercept[3] 0.328 0.00468 0.318 0.337
4 b_Intercept[4] 0.562 0.00499 0.552 0.571
5 b_Intercept[5] 0.709 0.00456 0.700 0.718
6 b_Intercept[6] 0.854 0.00351 0.847 0.861

Let’s get our posterior predictive distribution too. (Bars are predicted probabilities, dots are observed).

Code
predicted_draws(m1, newdata = data.frame(1)) %>% 
  count(.prediction) %>%
  mutate(pr_k     = n / sum(n)) %>% 
  ggplot( aes( x=.prediction, y=pr_k ) ) +
  geom_bar( stat="identity", fill = "#1c5253", color = "white", alpha = .5) +
  geom_point(data = trolley_plot, aes(x=response)) +
  labs( x="response" ,
        y="probability")

adding predictors

Now let’s add the features of the story – contact, action, and intention – as predictors in our model. Let’s expand our mathematical model. Here’s our original:

\[\begin{align*} R_i &\sim \text{Categorical}(p) \\ \text{logit}(p_k) = \alpha_k - \phi \phi_i &= 0 \\ \alpha_k &\sim \text{Normal}(0, 1.5) \end{align*}\]

McElreath proposes the following causal formula:

\[ \phi_i = \beta_1\text{action}_i + \beta_2\text{contact}_i + (\beta_3 + \beta_4\text{action}_i + \beta_5\text{contact}_i)\text{intention}_i \]

Code
m2 <- brm(
  data = trolley,
  family = cumulative, 
  response ~ 1 + action + contact + intention + action:intention + contact:intention, 
  prior = c( prior(normal(0, 1.5), class = Intercept),
             prior(normal(0, 0.5), class = b)),
  iter=2000, warmup=1000, cores=4, chains=4,
  file=here("files/data/generated_data/m61.2")
)

m2
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: response ~ 1 + action + contact + intention + action:intention + contact:intention 
   Data: trolley (Number of observations: 9930) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]         -2.64      0.05    -2.74    -2.53 1.00     2973     2727
Intercept[2]         -1.94      0.05    -2.04    -1.85 1.00     3113     2791
Intercept[3]         -1.35      0.05    -1.44    -1.26 1.00     2979     3029
Intercept[4]         -0.31      0.04    -0.40    -0.23 1.00     2785     2912
Intercept[5]          0.36      0.04     0.27     0.44 1.00     2886     3000
Intercept[6]          1.26      0.05     1.17     1.35 1.00     3129     2816
action               -0.48      0.05    -0.58    -0.37 1.00     3072     2837
contact              -0.35      0.07    -0.48    -0.21 1.00     3031     2820
intention            -0.30      0.06    -0.41    -0.18 1.00     2694     2861
action:intention     -0.43      0.08    -0.59    -0.27 1.00     2949     3105
contact:intention    -1.23      0.10    -1.42    -1.05 1.00     2710     2794

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
as_draws_df(m2) %>% 
  select(starts_with("b_")) %>% 
  select(-matches("[0-9]")) %>% 
  pivot_longer(everything()) %>% 
  mutate(
    name = str_remove(name, "b_"),
    name = str_replace(name, ":", " x ")
  ) %>% 
  ggplot( aes(x = value, y = name) ) +
  geom_vline( aes(xintercept = 0), linetype = "dashed", alpha = .3 ) +
  stat_dist_halfeye(fill = "#5e8485") +
  scale_x_continuous("marginal posterior", breaks = -5:0 / 4) +
  scale_y_discrete(NULL) +
  coord_cartesian(xlim = c(-1.4, 0))
nd <- 
  trolley %>% 
  distinct(action, contact, intention) %>% 
  mutate(combination = str_c(action, contact, intention, sep = "_"))

f <- predicted_draws( m2 , newdata = nd )

# what have we done?
f %>% str()
gropd_df [24,000 × 9] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ action     : int [1:24000] 0 0 0 0 0 0 0 0 0 0 ...
 $ contact    : int [1:24000] 1 1 1 1 1 1 1 1 1 1 ...
 $ intention  : int [1:24000] 0 0 0 0 0 0 0 0 0 0 ...
 $ combination: chr [1:24000] "0_1_0" "0_1_0" "0_1_0" "0_1_0" ...
 $ .row       : int [1:24000] 1 1 1 1 1 1 1 1 1 1 ...
 $ .chain     : int [1:24000] NA NA NA NA NA NA NA NA NA NA ...
 $ .iteration : int [1:24000] NA NA NA NA NA NA NA NA NA NA ...
 $ .draw      : int [1:24000] 1 2 3 4 5 6 7 8 9 10 ...
 $ .prediction: Factor w/ 7 levels "1","2","3","4",..: 3 4 4 3 5 6 3 5 6 5 ...
 - attr(*, "groups")= tibble [6 × 6] (S3: tbl_df/tbl/data.frame)
  ..$ action     : int [1:6] 0 0 0 0 1 1
  ..$ contact    : int [1:6] 0 0 1 1 0 0
  ..$ intention  : int [1:6] 0 1 0 1 0 1
  ..$ combination: chr [1:6] "0_0_0" "0_0_1" "0_1_0" "0_1_1" ...
  ..$ .row       : int [1:6] 4 6 1 2 3 5
  ..$ .rows      : list<int> [1:6] 
  .. ..$ : int [1:4000] 12001 12002 12003 12004 12005 12006 12007 12008 12009 12010 ...
  .. ..$ : int [1:4000] 20001 20002 20003 20004 20005 20006 20007 20008 20009 20010 ...
  .. ..$ : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:4000] 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 ...
  .. ..$ : int [1:4000] 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 ...
  .. ..$ : int [1:4000] 16001 16002 16003 16004 16005 16006 16007 16008 16009 16010 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
Code
f %>% 
  as.data.frame() %>% 
  mutate(
    across( c(action, contact, intention) , 
            as.character),
    action  = str_c("action = ",  action),
    contact = str_c("contact = ", contact)) %>% 
  count(action, contact, intention, .prediction) %>% 
  ggplot( aes(x=.prediction, y=n, fill=intention) )+
  geom_bar( stat="identity", position="dodge" ) +
  facet_grid(~action+contact) +
  labs( x="response", 
        y="count" )

Ordered categorical predictors

distinct(trolley, edu)
                   edu
1        Middle School
2    Bachelor's Degree
3         Some College
4      Master's Degree
5 High School Graduate
6      Graduate Degree
7     Some High School
8    Elementary School
trolley <-
  trolley %>% 
  mutate(edu_new = 
           recode(edu,
                  "Elementary School" = 1,
                  "Middle School" = 2,
                  "Some High School" = 3,
                  "High School Graduate" = 4,
                  "Some College" = 5, 
                  "Bachelor's Degree" = 6,
                  "Master's Degree" = 7,
                  "Graduate Degree" = 8) %>% 
           as.integer())

To incorporate this variable as a MONOTONIC CATEOGORICAL PREDICTOR, we’ll set up a notation such that each step up in value comes with its own incremental or marginal effect on the outcome.

\[ \phi_i = \sum_{j=1}^7 \delta_j \] We only need 7 because the first category is absorbed into the intercept for \(\phi\). So our full formula for this parameter is: \[ \phi_i = \beta_1\text{action}_i + \beta_2\text{contact}_i + \beta_3\text{intention}_i + \beta_E\sum_{j=0}^{E_i-1} \delta_j \] where \(\beta_E\) is the average effect

\[\begin{align*} \text{response}_i &\sim \text{Categorical}(\mathbf{p}) \\ \text{logit}(p_k) &= \alpha_k - \phi_i \\ \phi_i &= \beta_1\text{action}_i + \beta_2\text{contact}_i + \beta_3\text{intention}_i + \beta_4 \text{ mo(edu\_new}_i, \boldsymbol{\delta}) \\ \alpha_k &\sim \text{Normal}(0, 1.5) \\ \beta_1, \ldots, \beta_3 &\sim \text{Normal}(0, 1) \\ \beta_4 &\sim \text{Normal}(0, 0.143) \\ \boldsymbol{\delta} &\sim \text{Dirichlet}(2, 2, 2, 2, 2, 2, 2), \end{align*}\]

m3 <- 
  brm(data = trolley, 
      family = cumulative,
      response ~ 1 + action + contact + intention + mo(edu_new),  # note the `mo()` syntax
      prior = c(prior(normal(0, 1.5), class = Intercept),
                prior(normal(0, 1), class = b),
                # note the new kinds of prior statements
                prior(normal(0, 0.143), class = b, coef = moedu_new),
                prior(dirichlet(2, 2, 2, 2, 2, 2, 2), class = simo, coef = moedu_new1)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4, 
      seed = 12,
      file = here("files/data/generated_data/m61.3"))
m3
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: response ~ 1 + action + contact + intention + mo(edu_new) 
   Data: trolley (Number of observations: 9930) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -3.13      0.17    -3.53    -2.84 1.00     1707     2012
Intercept[2]    -2.45      0.17    -2.83    -2.16 1.00     1702     1934
Intercept[3]    -1.87      0.17    -2.25    -1.58 1.00     1708     1804
Intercept[4]    -0.85      0.17    -1.23    -0.56 1.00     1731     1956
Intercept[5]    -0.18      0.17    -0.56     0.11 1.00     1718     1925
Intercept[6]     0.73      0.17     0.35     1.01 1.00     1737     1988
action          -0.71      0.04    -0.78    -0.63 1.00     4604     3182
contact         -0.96      0.05    -1.06    -0.86 1.00     4521     2761
intention       -0.72      0.04    -0.79    -0.65 1.00     4514     2992
moedu_new       -0.05      0.03    -0.11    -0.00 1.00     1705     1866

Monotonic Simplex Parameters:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
moedu_new1[1]     0.26      0.15     0.04     0.59 1.00     2292     3104
moedu_new1[2]     0.14      0.09     0.02     0.35 1.00     4405     2501
moedu_new1[3]     0.19      0.11     0.03     0.44 1.00     4005     2613
moedu_new1[4]     0.16      0.09     0.03     0.39 1.00     3636     2564
moedu_new1[5]     0.04      0.05     0.00     0.15 1.00     2606     1894
moedu_new1[6]     0.09      0.06     0.01     0.25 1.00     3747     2484
moedu_new1[7]     0.12      0.07     0.02     0.29 1.00     3829     2596

Further Distributional Parameters:
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s see the effect of education after controlling for story parameters.

Code
nd <- distinct(
  trolley, 
  action, contact, intention, edu_new
)
predicted_draws(m3, nd) %>% 
  ungroup() %>% 
  count(edu_new, .prediction) %>% 
  with_groups(edu_new, mutate, total=sum(n)) %>% 
  mutate(pk = n/total,
         .prediction = as.numeric(.prediction),
         edu_new = as.factor(edu_new)) %>% 
  ggplot(aes(x = .prediction,  y = pk, color = edu_new)) +
  geom_point() +
  geom_line() +
  labs( x="response", 
        y="probability" ) +
  theme(legend.position = "top")